Text Similarity Detection with Sentence-Transformers (Part 2)

In the previous article, we introduced text similarity detection techniques based on Levenshtein distance and TF-IDF with cosine similarity.

In this article, we move forward and explore semantic-level text similarity, which plays a critical role in web scraping data cleaning, deduplication, and downstream model fine-tuning.

Sentence-Transformers for Semantic Text Similarity

Sentence-transformers map sentences, paragraphs, or even long documents into a high-dimensional vector space.
In this vector space, semantically similar texts stay closer together, which allows systems to compute similarity efficiently using vector operations such as cosine similarity.

Sentence-transformers build on top of pre-trained language models like BERT and MiniLM. The framework optimizes these models specifically for sentence-level semantic understanding.

How Sentence-Transformers Encode Text

1. Text Encoding with Pre-trained Language Models

Sentence-transformers reuse language models that developers have already trained on massive corpora.
These base models understand vocabulary, grammar, and general semantics before any task-specific training begins.

The encoding process works as follows:

The tokenizer splits input text into subwords.
The pre-trained model processes these tokens.
The system outputs a fixed-length embedding vector.

For example, the sentence:

“I love natural language processing”

produces a vector with 384 or 768 dimensions, depending on the selected model.

2. Sentence-Level Semantic Pooling

Pre-trained models generate vectors for each token, not for the entire sentence.
Sentence-transformers combine token vectors into a single sentence embedding using pooling strategies.

Common pooling strategies include:

Mean Pooling – averages all token vectors and balances semantic contribution
Max Pooling – highlights the most salient features
CLS Pooling – uses the [CLS] token representation (model-dependent)

Mean pooling remains the most widely used approach in production systems.

3. Fine-Tuning for Semantic Relevance

Sentence-transformers improve semantic alignment through fine-tuning on labeled sentence pairs.

During training, the system:

Pulls semantically similar sentences closer together
Pushes unrelated sentences farther apart

Common optimization strategies include:

Contrastive Learning
Triplet Loss

After fine-tuning, the model handles tasks such as synonym detection, paraphrase identification, and semantic equivalence more accurately.

4. Similarity Calculation

After generating embeddings, the system calculates similarity using cosine similarity or Euclidean distance.

A cosine similarity score closer to 1 indicates stronger semantic similarity.

Installing Sentence-Transformers

Install the library with pip:

pip install sentence-transformers

The first execution downloads the selected model automatically from Hugging Face.

Basic Example: Semantic Similarity Detection

Test Sentences

text_a = "Artificial intelligence is transforming modern society through automation and data analysis."
text_b = "Machine learning algorithms are changing contemporary culture by automating processes and analyzing information."
text_c = "Climate change affects global weather patterns and requires immediate environmental action."

Similarity Calculation Code

from sentence_transformers import SentenceTransformer, util

MODEL_PATH = "./local-models/all-MiniLM-L6-v2"

def calculate_similarity(text1, text2, model):
    embedding1 = model.encode(text1, convert_to_tensor=True)
    embedding2 = model.encode(text2, convert_to_tensor=True)
    return util.cos_sim(embedding1, embedding2).item()

def main():
    model = SentenceTransformer(MODEL_PATH)

    sim_ab = calculate_similarity(text_a, text_b, model)
    sim_ac = calculate_similarity(text_a, text_c, model)

    print(f"Similarity A–B: {sim_ab:.4f}")
    print(f"Similarity A–C: {sim_ac:.4f}")

if __name__ == "__main__":
    main()

Output Interpretation

Similarity A–B: 0.6630
Similarity A–C: 0.1917

The system correctly identifies that A and B share moderate semantic similarity, while A and C remain unrelated.

Why Fine-Tune Sentence-Transformers?

General-purpose models learn broad semantic rules, but domain-specific tasks require specialization.

Fine-tuning improves performance in scenarios such as:

Medical document similarity
Legal text comparison
E-commerce review deduplication
Customer support question matching

Empirical results often show 10%–30% performance gains after fine-tuning.

Fine-Tuning Workflow Overview

1. Data Preparation

Prepare labeled sentence pairs with similarity scores between 0 and 1.
High-quality annotations matter more than raw volume.

2. Base Model Selection

Recommended options:

Chinese: uer/sbert-base-chinese-nli
Multilingual: paraphrase-multilingual-MiniLM-L12-v2
English: all-MiniLM-L6-v2

3. Loss Function Selection

CosineSimilarityLoss – regression-style similarity
ContrastiveLoss – binary similarity classification
TripletLoss – anchor-positive-negative learning

4. Training Configuration

batch_size: 2–32
num_epochs: 10–50
warmup_steps: ~10% of total steps

Fine-Tuning Example Code

from sentence_transformers import SentenceTransformer, InputExample, losses, evaluation, util
from torch.utils.data import DataLoader
import pandas as pd
import random

data = [
    {"sentence1": "Artificial intelligence is transforming healthcare", "sentence2": "AI is revolutionizing medical services", "score": 0.91},
    {"sentence1": "Natural language processing enables chatbots", "sentence2": "NLP powers conversational AI systems", "score": 0.93},
]

df = pd.DataFrame(data)

train_examples = [
    InputExample(texts=[row["sentence1"], row["sentence2"]], label=row["score"])
    for _, row in df.iterrows()
]

random.shuffle(train_examples)

train_dataloader = DataLoader(train_examples, shuffle=True, batch_size=2)

model = SentenceTransformer("all-MiniLM-L6-v2")
train_loss = losses.CosineSimilarityLoss(model)

model.fit(
    train_objectives=[(train_dataloader, train_loss)],
    epochs=15,
    warmup_steps=10,
    output_path="./fine-tuned-all-MiniLM-L6-v2"
)

Final Notes

Sentence-transformers handle semantic similarity far beyond lexical overlap
Fine-tuning aligns models with real-world domain requirements
GPU acceleration significantly improves training speed

In production systems, these techniques are commonly used downstream of a web scraping API.